This notebook analyzes tweets by wizkidfc over a 1 week period, the dataset contains 2076 tweets which is an excel file with two sheets, the first is the tweets and info about each tweet, while the second contains info about the tweep for each tweet.
We begin by importing the necessary libraries and packages
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import plotly.express as px
import plotly.io as pio
pio.renderers.default = "notebook+pdf+jupyterlab"
from wordcloud import WordCloud
from nltk.stem import WordNetLemmatizer
from sklearn.svm import LinearSVC
from sklearn.naive_bayes import BernoulliNB
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import confusion_matrix, classification_report
In the cell below, the data in the tweets sheet is imported and assigned to variable wizkid, the cells following this one are just helping to acquire more info about the features of the dataset and help us get more ideas about the dataset in general
wizkid = pd.read_excel('wizkid_tweets.xlsx', sheet_name='tweets')
wizkid.head()
| Tweet Id | Text | Name | Screen Name | UTC | Created At | Favorites | Retweets | Language | Client | Tweet Type | URLs | Hashtags | Mentions | Media Type | Media URLs | Unnamed: 16 | Unnamed: 17 | Unnamed: 18 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1510157522514153474 | RT @Kwm913 : WIZKID LE 16 SEPTEMBRE A PANAAAAM... | Xaro | Xaro_music | 2022-04-02T07:29:45.000Z | Sat Apr 02 07:29:45 +0000 2022 | 0 | 0 | en | <a href="http://twitter.com/download/android" ... | Retweet | NaN | 0 | 0 | NaN | NaN | NaN | NaN | NaN |
| 1 | 1510157521859878916 | RT @Abbye_edi__ : 30bgs have been cooking and ... | Godwinvictor5 | vkidofficial | 2022-04-02T07:29:45.000Z | Sat Apr 02 07:29:45 +0000 2022 | 0 | 0 | en | <a href="http://twitter.com/download/android" ... | Retweet | NaN | 0 | 0 | NaN | NaN | NaN | NaN | NaN |
| 2 | 1510157521377546242 | RT @Cruisewithmee : There’s a reason the indus... | timsonkim | pablotimsonTN | 2022-04-02T07:29:45.000Z | Sat Apr 02 07:29:45 +0000 2022 | 0 | 0 | en | <a href="http://twitter.com/download/iphone" r... | Retweet | NaN | 0 | 0 | NaN | NaN | NaN | NaN | NaN |
| 3 | 1510157516432420865 | @heisizumichaels Davido, Wizkid, Burna boy and... | Sir Mondaylee💡 | Mondaylee | 2022-04-02T07:29:43.000Z | Sat Apr 02 07:29:43 +0000 2022 | 0 | 0 | en | <a href="http://twitter.com/download/android" ... | Reply | NaN | 0 | 1 | NaN | NaN | NaN | NaN | NaN |
| 4 | 1510157513991278601 | RT @_DianaLuv : Nicki minaj Body, Wizkid menta... | 🇳🇬 OBA OF KOGI 👑 | oba_tizer | 2022-04-02T07:29:43.000Z | Sat Apr 02 07:29:43 +0000 2022 | 0 | 0 | en | <a href="http://twitter.com/download/android" ... | Retweet | NaN | 0 | 0 | photo | https://pbs.twimg.com/media/FPPSC0iWUAUGN0Y.jpg | NaN | NaN | NaN |
wizkid.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 2076 entries, 0 to 2075 Data columns (total 19 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Tweet Id 2076 non-null int64 1 Text 2076 non-null object 2 Name 2076 non-null object 3 Screen Name 2076 non-null object 4 UTC 2076 non-null object 5 Created At 2076 non-null object 6 Favorites 2076 non-null int64 7 Retweets 2076 non-null int64 8 Language 2076 non-null object 9 Client 2076 non-null object 10 Tweet Type 2076 non-null object 11 URLs 429 non-null object 12 Hashtags 2076 non-null int64 13 Mentions 2076 non-null int64 14 Media Type 637 non-null object 15 Media URLs 637 non-null object 16 Unnamed: 16 51 non-null object 17 Unnamed: 17 35 non-null object 18 Unnamed: 18 22 non-null object dtypes: int64(5), object(14) memory usage: 308.3+ KB
wizkid.isnull().sum()
Tweet Id 0 Text 0 Name 0 Screen Name 0 UTC 0 Created At 0 Favorites 0 Retweets 0 Language 0 Client 0 Tweet Type 0 URLs 1647 Hashtags 0 Mentions 0 Media Type 1439 Media URLs 1439 Unnamed: 16 2025 Unnamed: 17 2041 Unnamed: 18 2054 dtype: int64
wizkid.dtypes
Tweet Id int64 Text object Name object Screen Name object UTC object Created At object Favorites int64 Retweets int64 Language object Client object Tweet Type object URLs object Hashtags int64 Mentions int64 Media Type object Media URLs object Unnamed: 16 object Unnamed: 17 object Unnamed: 18 object dtype: object
The unnamed features are not required, so they are removed.
wizkid.drop(['Unnamed: 16', 'Unnamed: 17', 'Unnamed: 18'], axis=1, inplace=True)
wizkid.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 2076 entries, 0 to 2075 Data columns (total 16 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Tweet Id 2076 non-null int64 1 Text 2076 non-null object 2 Name 2076 non-null object 3 Screen Name 2076 non-null object 4 UTC 2076 non-null object 5 Created At 2076 non-null object 6 Favorites 2076 non-null int64 7 Retweets 2076 non-null int64 8 Language 2076 non-null object 9 Client 2076 non-null object 10 Tweet Type 2076 non-null object 11 URLs 429 non-null object 12 Hashtags 2076 non-null int64 13 Mentions 2076 non-null int64 14 Media Type 637 non-null object 15 Media URLs 637 non-null object dtypes: int64(5), object(11) memory usage: 259.6+ KB
We're not really going to be working with pictures or videos, so its better to just drop the media url and urls column.
wizkid.drop(['Media URLs', 'URLs'], axis=1, inplace=True)
wizkid.head()
| Tweet Id | Text | Name | Screen Name | UTC | Created At | Favorites | Retweets | Language | Client | Tweet Type | Hashtags | Mentions | Media Type | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1510157522514153474 | RT @Kwm913 : WIZKID LE 16 SEPTEMBRE A PANAAAAM... | Xaro | Xaro_music | 2022-04-02T07:29:45.000Z | Sat Apr 02 07:29:45 +0000 2022 | 0 | 0 | en | <a href="http://twitter.com/download/android" ... | Retweet | 0 | 0 | NaN |
| 1 | 1510157521859878916 | RT @Abbye_edi__ : 30bgs have been cooking and ... | Godwinvictor5 | vkidofficial | 2022-04-02T07:29:45.000Z | Sat Apr 02 07:29:45 +0000 2022 | 0 | 0 | en | <a href="http://twitter.com/download/android" ... | Retweet | 0 | 0 | NaN |
| 2 | 1510157521377546242 | RT @Cruisewithmee : There’s a reason the indus... | timsonkim | pablotimsonTN | 2022-04-02T07:29:45.000Z | Sat Apr 02 07:29:45 +0000 2022 | 0 | 0 | en | <a href="http://twitter.com/download/iphone" r... | Retweet | 0 | 0 | NaN |
| 3 | 1510157516432420865 | @heisizumichaels Davido, Wizkid, Burna boy and... | Sir Mondaylee💡 | Mondaylee | 2022-04-02T07:29:43.000Z | Sat Apr 02 07:29:43 +0000 2022 | 0 | 0 | en | <a href="http://twitter.com/download/android" ... | Reply | 0 | 1 | NaN |
| 4 | 1510157513991278601 | RT @_DianaLuv : Nicki minaj Body, Wizkid menta... | 🇳🇬 OBA OF KOGI 👑 | oba_tizer | 2022-04-02T07:29:43.000Z | Sat Apr 02 07:29:43 +0000 2022 | 0 | 0 | en | <a href="http://twitter.com/download/android" ... | Retweet | 0 | 0 | photo |
To check for the total tweets in that period
print('Total tweets this period: ', len(wizkid))
Total tweets this period: 2076
To begin our Exploratory Data Analysis(EDA). We're going to first check for the most retweeted tweets, in the period.
retweet_df = wizkid.sort_values(by='Retweets', ascending=False).reset_index(drop=True)
print("The top 5 most retweeted tweets", retweet_df.head(5)['Name'], retweet_df['Text'].head(5), retweet_df['Retweets'].head(5))
The top 5 most retweeted tweets 0 olyria roy ☮️ 1 Beautiful Synto 2 LERE BOY 3 Dapsy𓃵 4 B L G 🤴💫 Name: Name, dtype: object 0 Peaceful ramadan, no davido, graduate Rihanna,... 1 Forget Rihanna Beyonce and Wizkid, How deep ca... 2 tatibiji sef dey drag artiste 😂 2baba probably... 3 I wonder why it is hard for some people to rea... 4 Ease your mind and Blessed arguably has to be ... Name: Text, dtype: object 0 74 1 34 2 32 3 29 4 18 Name: Retweets, dtype: int64
It's better to create a dataframe, so it can be visualized more easily
d = {'Name': retweet_df['Name'].head(5), 'Tweets': retweet_df['Text'].head(5), 'Retweets': retweet_df['Retweets'].head(5)}
retweet_df = pd.DataFrame(data=d)
retweet_df
| Name | Tweets | Retweets | |
|---|---|---|---|
| 0 | olyria roy ☮️ | Peaceful ramadan, no davido, graduate Rihanna,... | 74 |
| 1 | Beautiful Synto | Forget Rihanna Beyonce and Wizkid, How deep ca... | 34 |
| 2 | LERE BOY | tatibiji sef dey drag artiste 😂 2baba probably... | 32 |
| 3 | Dapsy𓃵 | I wonder why it is hard for some people to rea... | 29 |
| 4 | B L G 🤴💫 | Ease your mind and Blessed arguably has to be ... | 18 |
Yeah, it's better this way. It's better when we visualize it as a chart though
fig = px.bar(retweet_df, x='Name', y='Retweets', title='Plot of tweep with respective number of retweets in the period')
fig.show()
retweet_df.query("Name == 'B L G 🤴💫'")['Tweets']
4 Ease your mind and Blessed arguably has to be ... Name: Tweets, dtype: object
Going further, another thing I'd like to check is the popularity of the different types of tweet clients
The cell below checks for the value count of each tweet client. From the output, we can see that Android is the most popular choice of smartphone among wizkid fans, It's possible you'd have thought it was going to be iphone, but android wins again .
One thing that surprises me most in the types of tweet client is the wizkid retweet bot, the bot helps to retweet posts, shout-out to whoever created this bot, you're also using your skills to help the fandom.
tweet_client2 = wizkid['Client'].value_counts()
tweet_client2.head()
<a href="http://twitter.com/download/android" rel="nofollow">Twitter for Android</a> 1135 <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a> 891 <a href="https://mobile.twitter.com" rel="nofollow">Twitter Web App</a> 37 <a href="https://help.twitter.com/en/using-twitter/how-to-tweet#source-labels" rel="nofollow">wizkid retweet bot</a> 6 <a href="https://ifttt.com" rel="nofollow">IFTTT</a> 2 Name: Client, dtype: int64
d = {'Client':['Twitter for android', 'Twitter for iPhone', 'Twitter web app', 'wizkid retweet bot', 'IFTTT'], 'Counts': [1135, 891, 37, 6, 2]}
client_df = pd.DataFrame(data=d)
client_df.head()
| Client | Counts | |
|---|---|---|
| 0 | Twitter for android | 1135 |
| 1 | Twitter for iPhone | 891 |
| 2 | Twitter web app | 37 |
| 3 | wizkid retweet bot | 6 |
| 4 | IFTTT | 2 |
To visualize it better, I'm gonna be using a bar chart once again
fig = px.bar(client_df, x='Client', y='Counts', title='Plot of tweet client with the count')
fig.show()
wizkid.head()
| Tweet Id | Text | Name | Screen Name | UTC | Created At | Favorites | Retweets | Language | Client | Tweet Type | Hashtags | Mentions | Media Type | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1510157522514153474 | RT @Kwm913 : WIZKID LE 16 SEPTEMBRE A PANAAAAM... | Xaro | Xaro_music | 2022-04-02T07:29:45.000Z | Sat Apr 02 07:29:45 +0000 2022 | 0 | 0 | en | <a href="http://twitter.com/download/android" ... | Retweet | 0 | 0 | NaN |
| 1 | 1510157521859878916 | RT @Abbye_edi__ : 30bgs have been cooking and ... | Godwinvictor5 | vkidofficial | 2022-04-02T07:29:45.000Z | Sat Apr 02 07:29:45 +0000 2022 | 0 | 0 | en | <a href="http://twitter.com/download/android" ... | Retweet | 0 | 0 | NaN |
| 2 | 1510157521377546242 | RT @Cruisewithmee : There’s a reason the indus... | timsonkim | pablotimsonTN | 2022-04-02T07:29:45.000Z | Sat Apr 02 07:29:45 +0000 2022 | 0 | 0 | en | <a href="http://twitter.com/download/iphone" r... | Retweet | 0 | 0 | NaN |
| 3 | 1510157516432420865 | @heisizumichaels Davido, Wizkid, Burna boy and... | Sir Mondaylee💡 | Mondaylee | 2022-04-02T07:29:43.000Z | Sat Apr 02 07:29:43 +0000 2022 | 0 | 0 | en | <a href="http://twitter.com/download/android" ... | Reply | 0 | 1 | NaN |
| 4 | 1510157513991278601 | RT @_DianaLuv : Nicki minaj Body, Wizkid menta... | 🇳🇬 OBA OF KOGI 👑 | oba_tizer | 2022-04-02T07:29:43.000Z | Sat Apr 02 07:29:43 +0000 2022 | 0 | 0 | en | <a href="http://twitter.com/download/android" ... | Retweet | 0 | 0 | photo |
The next thing, I'm going to be doing is analyze tweet sentiments.
Just a side note, i didn't know much about Natural Language Processing before starting this analysis, so I had to do some reading and also take a quick course on it, I'm still not conversant with it but I kind of know my way around it now.
For the sentiment analysis, I used textblob. Textblob is the python library for processing textual data. Textblob is a high level library built on top of NLTK library.
The function below helps to get the subjectivity and polarity of each tweet. Subjectivity here refers to tweets that generally refer to personal opinion, emotion or judgement whereas objective refers to factual information. Subjectivity is a float which lies in the range of [0,1]
Polarity is also a flot which lies in the range [-1,1] where 1 means positive satement and -1 means a negative statement.
from textblob import TextBlob
def get_subjectivity(text):
return TextBlob(text).sentiment.subjectivity
def get_polarity(text):
return TextBlob(text).sentiment.polarity
wizkid['subjectivity'] = wizkid['Text'].apply(get_subjectivity)
wizkid['polarity'] = wizkid['Text'].apply(get_polarity)
wizkid.head(5)
| Tweet Id | Text | Name | Screen Name | UTC | Created At | Favorites | Retweets | Language | Client | Tweet Type | Hashtags | Mentions | Media Type | subjectivity | polarity | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1510157522514153474 | RT @Kwm913 : WIZKID LE 16 SEPTEMBRE A PANAAAAM... | Xaro | Xaro_music | 2022-04-02T07:29:45.000Z | Sat Apr 02 07:29:45 +0000 2022 | 0 | 0 | en | <a href="http://twitter.com/download/android" ... | Retweet | 0 | 0 | NaN | 0.000000 | 0.000 |
| 1 | 1510157521859878916 | RT @Abbye_edi__ : 30bgs have been cooking and ... | Godwinvictor5 | vkidofficial | 2022-04-02T07:29:45.000Z | Sat Apr 02 07:29:45 +0000 2022 | 0 | 0 | en | <a href="http://twitter.com/download/android" ... | Retweet | 0 | 0 | NaN | 0.350000 | -0.175 |
| 2 | 1510157521377546242 | RT @Cruisewithmee : There’s a reason the indus... | timsonkim | pablotimsonTN | 2022-04-02T07:29:45.000Z | Sat Apr 02 07:29:45 +0000 2022 | 0 | 0 | en | <a href="http://twitter.com/download/iphone" r... | Retweet | 0 | 0 | NaN | 0.000000 | 0.000 |
| 3 | 1510157516432420865 | @heisizumichaels Davido, Wizkid, Burna boy and... | Sir Mondaylee💡 | Mondaylee | 2022-04-02T07:29:43.000Z | Sat Apr 02 07:29:43 +0000 2022 | 0 | 0 | en | <a href="http://twitter.com/download/android" ... | Reply | 0 | 1 | NaN | 0.558333 | 0.450 |
| 4 | 1510157513991278601 | RT @_DianaLuv : Nicki minaj Body, Wizkid menta... | 🇳🇬 OBA OF KOGI 👑 | oba_tizer | 2022-04-02T07:29:43.000Z | Sat Apr 02 07:29:43 +0000 2022 | 0 | 0 | en | <a href="http://twitter.com/download/android" ... | Retweet | 0 | 0 | photo | 0.000000 | 0.000 |
After getting the polarity of each tweet, It makes more sense to categorize them as positive, negative or neutral.
Tweets with a score less than zero are negative, tweets with a score of zero are neutral while tweets with a score more than zero are positive tweets.
def getAnalysis(score):
if score < 0:
return 'Negative'
elif score == 0:
return 'Neutral'
else:
return 'Positive'
wizkid['analysis'] = wizkid['polarity'].apply(getAnalysis)
wizkid.head(5)
| Tweet Id | Text | Name | Screen Name | UTC | Created At | Favorites | Retweets | Language | Client | Tweet Type | Hashtags | Mentions | Media Type | subjectivity | polarity | analysis | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1510157522514153474 | RT @Kwm913 : WIZKID LE 16 SEPTEMBRE A PANAAAAM... | Xaro | Xaro_music | 2022-04-02T07:29:45.000Z | Sat Apr 02 07:29:45 +0000 2022 | 0 | 0 | en | <a href="http://twitter.com/download/android" ... | Retweet | 0 | 0 | NaN | 0.000000 | 0.000 | Neutral |
| 1 | 1510157521859878916 | RT @Abbye_edi__ : 30bgs have been cooking and ... | Godwinvictor5 | vkidofficial | 2022-04-02T07:29:45.000Z | Sat Apr 02 07:29:45 +0000 2022 | 0 | 0 | en | <a href="http://twitter.com/download/android" ... | Retweet | 0 | 0 | NaN | 0.350000 | -0.175 | Negative |
| 2 | 1510157521377546242 | RT @Cruisewithmee : There’s a reason the indus... | timsonkim | pablotimsonTN | 2022-04-02T07:29:45.000Z | Sat Apr 02 07:29:45 +0000 2022 | 0 | 0 | en | <a href="http://twitter.com/download/iphone" r... | Retweet | 0 | 0 | NaN | 0.000000 | 0.000 | Neutral |
| 3 | 1510157516432420865 | @heisizumichaels Davido, Wizkid, Burna boy and... | Sir Mondaylee💡 | Mondaylee | 2022-04-02T07:29:43.000Z | Sat Apr 02 07:29:43 +0000 2022 | 0 | 0 | en | <a href="http://twitter.com/download/android" ... | Reply | 0 | 1 | NaN | 0.558333 | 0.450 | Positive |
| 4 | 1510157513991278601 | RT @_DianaLuv : Nicki minaj Body, Wizkid menta... | 🇳🇬 OBA OF KOGI 👑 | oba_tizer | 2022-04-02T07:29:43.000Z | Sat Apr 02 07:29:43 +0000 2022 | 0 | 0 | en | <a href="http://twitter.com/download/android" ... | Retweet | 0 | 0 | photo | 0.000000 | 0.000 | Neutral |
The next cell checks for the value count of each tweet sentiment, tweets with positive sentiments came out on top. 😊🎉🎉 You're happy right? Well, I am too 😂. Let's just try to keep tweets with negative sentiments down 🤞🤞.
wizkid['analysis'].value_counts()
Positive 1163 Neutral 577 Negative 336 Name: analysis, dtype: int64
As always, every analysis we do is better visualized with a chart, a bar chart is also the best for this.
plt.title('Sentiment Analysis')
plt.xlabel('Sentiment')
plt.ylabel('Counts')
wizkid['analysis'].value_counts().plot(kind='bar')
plt.show()
The next analysis we are going to be doing is to plot a wordcloud which is based on the most popular words in the tweet column. Before we can plot the wordcloud though, we have to do some preprocessing.
The cell below convert all the tweets into lowercase letters.
wizkid['Text'] = wizkid['Text'].str.lower()
wizkid['Text'].tail()
2071 rt @ms_fej : 2baba will still post wizkid when... 2072 rt @jameelasosexy : what wizkid & buju bnx... 2073 rt @firstladyship : why do nigerian stans figh... 2074 rt @_asiwajulerry : you wonder why people like... 2075 rt @savvy_elijah : nobody:\ndavido patiently w... Name: Text, dtype: object
stopwordlist = ['rt', 'a', 'about', 'above', 'after', 'again', 'ain', 'all', 'am', 'an', 'and','any','are', 'as', 'at', 'be', 'because', 'been', 'before', 'being', 'below', 'between','both', 'by', 'can', 'd', 'did', 'do', 'does', 'doing', 'down', 'during', 'each','few', 'for', 'from', 'further', 'had', 'has', 'have', 'having', 'he', 'her', 'here', 'hers', 'herself', 'him', 'himself', 'his', 'how', 'i', 'if', 'in', 'into','is', 'it', 'its', 'itself', 'just', 'll', 'm', 'ma', 'me', 'more', 'most','my', 'myself', 'now', 'o', 'of', 'on', 'once', 'only', 'or', 'other', 'our', 'ours','ourselves', 'out', 'own', 're','s', 'same', 'she', "shes", 'should', "shouldve",'so', 'some', 'such', 't', 'than', 'that', "thatll", 'the', 'their', 'theirs', 'them', 'themselves', 'then', 'there', 'these', 'they', 'this', 'those', 'through', 'to', 'too','under', 'until', 'up', 've', 'very', 'was', 'we', 'were', 'what', 'when', 'where','which','while', 'who', 'whom', 'why', 'will', 'with', 'won', 'y', 'you', "youd","youll", "youre", "youve", 'your', 'yours', 'yourself', 'yourselves']
STOPWORDS = set(stopwordlist)
def cleaning_stopwords(text):
return " ".join([word for word in str(text).split() if word not in STOPWORDS])
wizkid['Text'] = wizkid['Text'].apply(lambda text: cleaning_stopwords(text))
wizkid['Text'].tail()
2071 @ms_fej : 2baba still post wizkid wins sunday!... 2072 @jameelasosexy : wizkid & buju bnxn said m... 2073 @firstladyship : nigerian stans fight over dav... 2074 @_asiwajulerry : wonder people like wizkid it’... 2075 @savvy_elijah : nobody: davido patiently waiti... Name: Text, dtype: object
The punctuations are also cleaned, thereby reducing the unnecessary noise from the dataset.
import string
english_punctuations = string.punctuation
punctuations_list = english_punctuations
def cleaning_punctuations(text):
translator = str.maketrans('', '', punctuations_list)
return text.translate(translator)
wizkid['Text']= wizkid['Text'].apply(lambda x: cleaning_punctuations(x))
wizkid['Text'].tail()
2071 msfej 2baba still post wizkid wins sunday one... 2072 jameelasosexy wizkid amp buju bnxn said mood ... 2073 firstladyship nigerian stans fight over david... 2074 asiwajulerry wonder people like wizkid it’s t... 2075 savvyelijah nobody davido patiently waiting w... Name: Text, dtype: object
After that, we also remove the repeating characters from the words.
import re
def cleaning_repeating_char(text):
return re.sub(r'(.)1+', r'1', text)
wizkid['Text'] = wizkid['Text'].apply(lambda x: cleaning_repeating_char(x))
wizkid['Text'].tail()
2071 msfej 2baba still post wizkid wins sunday one... 2072 jameelasosexy wizkid amp buju bnxn said mood ... 2073 firstladyship nigerian stans fight over david... 2074 asiwajulerry wonder people like wizkid it’s t... 2075 savvyelijah nobody davido patiently waiting w... Name: Text, dtype: object
This next cell cleans URLs in all the tweets
def cleaning_URLs(data):
return re.sub('((www.[^s]+)|(https?://[^s]+))',' ',data)
wizkid['Text'] = wizkid['Text'].apply(lambda x: cleaning_URLs(x))
wizkid['Text'].tail()
2071 msfej 2baba still post wizkid wins sunday one... 2072 jameelasosexy wizkid amp buju bnxn said mood ... 2073 firstladyship nigerian stans fight over david... 2074 asiwajulerry wonder people like wizkid it’s t... 2075 savvyelijah nobody davido patiently waiting w... Name: Text, dtype: object
This next cell cleans numbers from the tweets
def cleaning_numbers(data):
return re.sub('[0-9]+', '', data)
wizkid['Text'] = wizkid['Text'].apply(lambda x: cleaning_numbers(x))
wizkid['Text'].tail()
2071 msfej baba still post wizkid wins sunday one ... 2072 jameelasosexy wizkid amp buju bnxn said mood ... 2073 firstladyship nigerian stans fight over david... 2074 asiwajulerry wonder people like wizkid it’s t... 2075 savvyelijah nobody davido patiently waiting w... Name: Text, dtype: object
This next cell tokenizes the cleaned tweets, tokenization helps to separate the sentences into their individual words
from nltk.tokenize import RegexpTokenizer
tokenizer = RegexpTokenizer('\w+|\$[\d\.]+|\S+')
wizkid['Text'] = wizkid.apply(lambda row: tokenizer.tokenize(row['Text']), axis=1)
wizkid['Text'].head()
0 [kwm, wizkid, le, septembre, panaaaame, ouiiiiii] 1 [abbyeedi, bgs, cooking, throwing, yabs, burna... 2 [cruisewithmee, there, ’s, reason, industry, c... 3 [heisizumichaels, davido, wizkid, burna, boy, ... 4 [dianaluv, nicki, minaj, body, wizkid, mentali... Name: Text, dtype: object
At last, then we performed stemming(reducing the words th their derived stems) and lemmatization(reducing the derived words to their root form known as lemma)
import nltk
st = nltk.PorterStemmer()
def stemming_on_text(data):
text = [st.stem(word) for word in data]
return data
wizkid['Text']= wizkid['Text'].apply(lambda x: stemming_on_text(x))
wizkid['Text'].head()
0 [kwm, wizkid, le, septembre, panaaaame, ouiiiiii] 1 [abbyeedi, bgs, cooking, throwing, yabs, burna... 2 [cruisewithmee, there, ’s, reason, industry, c... 3 [heisizumichaels, davido, wizkid, burna, boy, ... 4 [dianaluv, nicki, minaj, body, wizkid, mentali... Name: Text, dtype: object
lm = nltk.WordNetLemmatizer()
def lemmatizer_on_text(data):
text = [lm.lemmatize(word) for word in data]
return data
wizkid['Text'] = wizkid['Text'].apply(lambda x: lemmatizer_on_text(x))
wizkid['Text'].head()
0 [kwm, wizkid, le, septembre, panaaaame, ouiiiiii] 1 [abbyeedi, bgs, cooking, throwing, yabs, burna... 2 [cruisewithmee, there, ’s, reason, industry, c... 3 [heisizumichaels, davido, wizkid, burna, boy, ... 4 [dianaluv, nicki, minaj, body, wizkid, mentali... Name: Text, dtype: object
Phew 😪. At long last, the preprocessing part is done.
new_tweets = " "
for tweets in wizkid.Text:
new_tweets += " ".join(tweets) + " "
#new_tweets
So, this is the part we've been preprocessing for 🎉🎉
wc = WordCloud(max_words=1000, width=1600, height=1000, collocations=False).generate(new_tweets)
plt.imshow(wc)
print(len(new_tweets))
256119
So, here is the wordcloud(I almost typed soundcloud 😂). As we can see from the chart, wizkid is the most popular word here, It has to be (all this is about him). The next is davido, this was kinda expected as they get compared in almost every tweet(which isn't at all necessary). I can see burna too, but boy is missing (another insight, people usually remove the boy and just call him burna). Next is the grammys, this has been the major subject in recent tweets. I feel bad he didn't win, we can go again next time 🤞.
I can also see love, wizkidfc is preaching love, that's nice, really nice, 😂
But then, I can see true, which is almost the same size as love, wizkidfc is preaching true love, that's lovely 😂 (I hope you understand what I did here).
So, here is a wordcloud of tweets with positive sentiment, it categorizes some words wrongly but we can still see the words which are categorized correctly.
positives = wizkid.query('analysis == "Positive"')
positive_tweets = " "
for positivess in positives.Text:
positive_tweets += " ".join(positivess) + " "
wc = WordCloud(max_words=1000, width=1600, height=1000, collocations=False).generate(positive_tweets)
plt.imshow(wc)
print(len(positive_tweets))
152345
So, here is a wordcloud of tweets with negative sentiment, it categorizes some words wrongly but we can still see the words which are categorized correctly.
negatives = wizkid.query('analysis == "Negative"')
negative_tweets = " "
for negativess in negatives.Text:
negative_tweets += " ".join(negativess) + " "
wc = WordCloud(max_words=1000, width=1600, height=1000, collocations=False).generate(negative_tweets)
plt.imshow(wc)
print(len(negative_tweets))
51746
wizkid.head(5)
| Tweet Id | Text | Name | Screen Name | UTC | Created At | Favorites | Retweets | Language | Client | Tweet Type | Hashtags | Mentions | Media Type | subjectivity | polarity | analysis | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1510157522514153474 | [kwm, wizkid, le, septembre, panaaaame, ouiiiiii] | Xaro | Xaro_music | 2022-04-02T07:29:45.000Z | Sat Apr 02 07:29:45 +0000 2022 | 0 | 0 | en | <a href="http://twitter.com/download/android" ... | Retweet | 0 | 0 | NaN | 0.000000 | 0.000 | Neutral |
| 1 | 1510157521859878916 | [abbyeedi, bgs, cooking, throwing, yabs, burna... | Godwinvictor5 | vkidofficial | 2022-04-02T07:29:45.000Z | Sat Apr 02 07:29:45 +0000 2022 | 0 | 0 | en | <a href="http://twitter.com/download/android" ... | Retweet | 0 | 0 | NaN | 0.350000 | -0.175 | Negative |
| 2 | 1510157521377546242 | [cruisewithmee, there, ’s, reason, industry, c... | timsonkim | pablotimsonTN | 2022-04-02T07:29:45.000Z | Sat Apr 02 07:29:45 +0000 2022 | 0 | 0 | en | <a href="http://twitter.com/download/iphone" r... | Retweet | 0 | 0 | NaN | 0.000000 | 0.000 | Neutral |
| 3 | 1510157516432420865 | [heisizumichaels, davido, wizkid, burna, boy, ... | Sir Mondaylee💡 | Mondaylee | 2022-04-02T07:29:43.000Z | Sat Apr 02 07:29:43 +0000 2022 | 0 | 0 | en | <a href="http://twitter.com/download/android" ... | Reply | 0 | 1 | NaN | 0.558333 | 0.450 | Positive |
| 4 | 1510157513991278601 | [dianaluv, nicki, minaj, body, wizkid, mentali... | 🇳🇬 OBA OF KOGI 👑 | oba_tizer | 2022-04-02T07:29:43.000Z | Sat Apr 02 07:29:43 +0000 2022 | 0 | 0 | en | <a href="http://twitter.com/download/android" ... | Retweet | 0 | 0 | photo | 0.000000 | 0.000 | Neutral |
The next analysis I wanto to do is going to be based on the time of the tweets, there a lot of insights that can be gotten from here.
I'm going to convert the Created At column into a datettime column, so it can be easy to work with.
wizkid['Created At'] = pd.to_datetime(wizkid['Created At'], dayfirst=True)
wizkid.head(5)
| Tweet Id | Text | Name | Screen Name | UTC | Created At | Favorites | Retweets | Language | Client | Tweet Type | Hashtags | Mentions | Media Type | subjectivity | polarity | analysis | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1510157522514153474 | [kwm, wizkid, le, septembre, panaaaame, ouiiiiii] | Xaro | Xaro_music | 2022-04-02T07:29:45.000Z | 2022-04-02 07:29:45+00:00 | 0 | 0 | en | <a href="http://twitter.com/download/android" ... | Retweet | 0 | 0 | NaN | 0.000000 | 0.000 | Neutral |
| 1 | 1510157521859878916 | [abbyeedi, bgs, cooking, throwing, yabs, burna... | Godwinvictor5 | vkidofficial | 2022-04-02T07:29:45.000Z | 2022-04-02 07:29:45+00:00 | 0 | 0 | en | <a href="http://twitter.com/download/android" ... | Retweet | 0 | 0 | NaN | 0.350000 | -0.175 | Negative |
| 2 | 1510157521377546242 | [cruisewithmee, there, ’s, reason, industry, c... | timsonkim | pablotimsonTN | 2022-04-02T07:29:45.000Z | 2022-04-02 07:29:45+00:00 | 0 | 0 | en | <a href="http://twitter.com/download/iphone" r... | Retweet | 0 | 0 | NaN | 0.000000 | 0.000 | Neutral |
| 3 | 1510157516432420865 | [heisizumichaels, davido, wizkid, burna, boy, ... | Sir Mondaylee💡 | Mondaylee | 2022-04-02T07:29:43.000Z | 2022-04-02 07:29:43+00:00 | 0 | 0 | en | <a href="http://twitter.com/download/android" ... | Reply | 0 | 1 | NaN | 0.558333 | 0.450 | Positive |
| 4 | 1510157513991278601 | [dianaluv, nicki, minaj, body, wizkid, mentali... | 🇳🇬 OBA OF KOGI 👑 | oba_tizer | 2022-04-02T07:29:43.000Z | 2022-04-02 07:29:43+00:00 | 0 | 0 | en | <a href="http://twitter.com/download/android" ... | Retweet | 0 | 0 | photo | 0.000000 | 0.000 | Neutral |
From the Created At column, i'll be creating other columns, which are the day column and the hour column
wizkid['day_of_tweet'] = wizkid['Created At'].dt.day
wizkid['hour_of_tweet'] = wizkid['Created At'].dt.hour
wizkid.head(5)
| Tweet Id | Text | Name | Screen Name | UTC | Created At | Favorites | Retweets | Language | Client | Tweet Type | Hashtags | Mentions | Media Type | subjectivity | polarity | analysis | day_of_tweet | hour_of_tweet | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1510157522514153474 | [kwm, wizkid, le, septembre, panaaaame, ouiiiiii] | Xaro | Xaro_music | 2022-04-02T07:29:45.000Z | 2022-04-02 07:29:45+00:00 | 0 | 0 | en | <a href="http://twitter.com/download/android" ... | Retweet | 0 | 0 | NaN | 0.000000 | 0.000 | Neutral | 2 | 7 |
| 1 | 1510157521859878916 | [abbyeedi, bgs, cooking, throwing, yabs, burna... | Godwinvictor5 | vkidofficial | 2022-04-02T07:29:45.000Z | 2022-04-02 07:29:45+00:00 | 0 | 0 | en | <a href="http://twitter.com/download/android" ... | Retweet | 0 | 0 | NaN | 0.350000 | -0.175 | Negative | 2 | 7 |
| 2 | 1510157521377546242 | [cruisewithmee, there, ’s, reason, industry, c... | timsonkim | pablotimsonTN | 2022-04-02T07:29:45.000Z | 2022-04-02 07:29:45+00:00 | 0 | 0 | en | <a href="http://twitter.com/download/iphone" r... | Retweet | 0 | 0 | NaN | 0.000000 | 0.000 | Neutral | 2 | 7 |
| 3 | 1510157516432420865 | [heisizumichaels, davido, wizkid, burna, boy, ... | Sir Mondaylee💡 | Mondaylee | 2022-04-02T07:29:43.000Z | 2022-04-02 07:29:43+00:00 | 0 | 0 | en | <a href="http://twitter.com/download/android" ... | Reply | 0 | 1 | NaN | 0.558333 | 0.450 | Positive | 2 | 7 |
| 4 | 1510157513991278601 | [dianaluv, nicki, minaj, body, wizkid, mentali... | 🇳🇬 OBA OF KOGI 👑 | oba_tizer | 2022-04-02T07:29:43.000Z | 2022-04-02 07:29:43+00:00 | 0 | 0 | en | <a href="http://twitter.com/download/android" ... | Retweet | 0 | 0 | photo | 0.000000 | 0.000 | Neutral | 2 | 7 |
After this, we'll be mapping each day with their respective day of the week
dw_mapping = {
0: 'Monday',
1: 'Tuesday',
2: 'Wednesday',
3: 'Thursday',
4: 'Friday',
5: 'Saturday',
6: 'Sunday'
}
wizkid['day_of_week_name'] = wizkid['Created At'].dt.weekday.map(dw_mapping)
After checking the days of the week, I realized all the 2076 tweets were all done on Saturday 😂, that's a lot for one day really, too much.
Because of this, there's not much pattern that can be genrated as we don't have too many instances to work with.
When I have time for this again, I'm gonna try to scrape more tweets that will cover a longer duration and make more analysis from that, more insight can be generated then.
wizkid.tail(5)
| Tweet Id | Text | Name | Screen Name | UTC | Created At | Favorites | Retweets | Language | Client | Tweet Type | Hashtags | Mentions | Media Type | subjectivity | polarity | analysis | day_of_tweet | hour_of_tweet | day_of_week_name | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2071 | 1510152398886604803 | [msfej, baba, still, post, wizkid, wins, sunda... | CHIDI | fineboijoshh | 2022-04-02T07:09:23.000Z | 2022-04-02 07:09:23+00:00 | 0 | 0 | en | <a href="http://twitter.com/download/android" ... | Retweet | 0 | 0 | NaN | 0.544444 | 0.578125 | Positive | 2 | 7 | Saturday |
| 2072 | 1510152394696441863 | [jameelasosexy, wizkid, amp, buju, bnxn, said,... | MITCHELLS | callmemichaels | 2022-04-02T07:09:22.000Z | 2022-04-02 07:09:22+00:00 | 0 | 0 | et | <a href="http://twitter.com/download/android" ... | Retweet | 0 | 0 | photo | 0.000000 | 0.000000 | Neutral | 2 | 7 | Saturday |
| 2073 | 1510152394327437317 | [firstladyship, nigerian, stans, fight, over, ... | Philip | Phillbetter_ | 2022-04-02T07:09:22.000Z | 2022-04-02 07:09:22+00:00 | 0 | 0 | en | <a href="http://twitter.com/download/iphone" r... | Retweet | 0 | 0 | NaN | 0.749796 | -0.010102 | Negative | 2 | 7 | Saturday |
| 2074 | 1510152389847830532 | [asiwajulerry, wonder, people, like, wizkid, i... | Chemical Father👑 | Victor_theplug | 2022-04-02T07:09:21.000Z | 2022-04-02 07:09:21+00:00 | 0 | 0 | en | <a href="http://twitter.com/download/android" ... | Retweet | 0 | 0 | NaN | 0.500000 | 0.500000 | Positive | 2 | 7 | Saturday |
| 2075 | 1510152388098859008 | [savvyelijah, nobody, davido, patiently, waiti... | kanmiey🥺♥️ | kanmiey | 2022-04-02T07:09:21.000Z | 2022-04-02 07:09:21+00:00 | 0 | 0 | en | <a href="http://twitter.com/download/android" ... | Retweet | 0 | 0 | photo | 0.000000 | 0.000000 | Neutral | 2 | 7 | Saturday |
The next thing that I want to look at is the most deserving of the wizkid fan badge in this period, I'll be checking the number of tweets they made as well as the sentiments.
users_df = wizkid['Name'].value_counts()
users_df.head(5)
UptownGuy🦅🦅🦅 35 LIL DURK 35 Nehza🇳🇬🦅 33 BurnaBoyFan 29 timsonkim 24 Name: Name, dtype: int64
There are two people with 35 tweets on this Saturday alone, but uptownguy came at the top for a reason, so let's check him out
uptownguy = wizkid.query("Name== 'UptownGuy🦅🦅🦅'")
uptownguy.head(5)
| Tweet Id | Text | Name | Screen Name | UTC | Created At | Favorites | Retweets | Language | Client | Tweet Type | Hashtags | Mentions | Media Type | subjectivity | polarity | analysis | day_of_tweet | hour_of_tweet | day_of_week_name | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 74 | 1510157377554817027 | [antigravitylite, prolly, didn, ’t, know, song... | UptownGuy🦅🦅🦅 | UptownGuy8 | 2022-04-02T07:29:10.000Z | 2022-04-02 07:29:10+00:00 | 0 | 0 | en | <a href="http://twitter.com/download/android" ... | Retweet | 0 | 0 | NaN | 0.600000 | 0.1000 | Positive | 2 | 7 | Saturday |
| 242 | 1510156991263608832 | [beaustevenblog, building, strongest, wizkid, ... | UptownGuy🦅🦅🦅 | UptownGuy8 | 2022-04-02T07:27:38.000Z | 2022-04-02 07:27:38+00:00 | 0 | 0 | en | <a href="http://twitter.com/download/android" ... | Retweet | 0 | 0 | NaN | 0.000000 | 0.0000 | Neutral | 2 | 7 | Saturday |
| 308 | 1510156858459369473 | [vivianporsche, time, sunday, still, undispute... | UptownGuy🦅🦅🦅 | UptownGuy8 | 2022-04-02T07:27:07.000Z | 2022-04-02 07:27:07+00:00 | 0 | 0 | en | <a href="http://twitter.com/download/android" ... | Retweet | 0 | 0 | NaN | 0.500000 | -0.2000 | Negative | 2 | 7 | Saturday |
| 333 | 1510156802306027521 | [starboyeurope, uk, apple, music, songs, chart... | UptownGuy🦅🦅🦅 | UptownGuy8 | 2022-04-02T07:26:53.000Z | 2022-04-02 07:26:53+00:00 | 0 | 0 | en | <a href="http://twitter.com/download/android" ... | Retweet | 0 | 0 | photo | 0.250000 | 0.3125 | Positive | 2 | 7 | Saturday |
| 367 | 1510156750556737542 | [mmafiaxco, wait, davido, really, unfollow, wi... | UptownGuy🦅🦅🦅 | UptownGuy8 | 2022-04-02T07:26:41.000Z | 2022-04-02 07:26:41+00:00 | 0 | 0 | en | <a href="http://twitter.com/download/android" ... | Retweet | 0 | 0 | NaN | 0.133333 | 0.1000 | Positive | 2 | 7 | Saturday |
uptownguy_tweets = " "
for tweetss in uptownguy.Text:
uptownguy_tweets += " ".join(tweetss)
wc = WordCloud(max_words=1000, width=1600, height=1000, collocations=False).generate(uptownguy_tweets)
plt.imshow(wc)
print(len(uptownguy_tweets))
3939
Wordcloud for uptownguys posts, (he also talks a lot about davido, we're gonna overlook that sha), apart from that, every other word seems positive
uptownguy['analysis'].value_counts()
Positive 21 Neutral 9 Negative 5 Name: analysis, dtype: int64
plt.title('uptownguy Analysis')
plt.xlabel('Sentiment')
plt.ylabel('Counts')
uptownguy['analysis'].value_counts().plot(kind='bar')
plt.show()
This is impressive really, uptownguy has 21 posts positive, which is about 60% of all his posts, that's cool uptownguy 🎉😂
users = pd.read_excel('wizkid_tweets.xlsx', sheet_name='users')
users.head(5)
| User Id | Name | Screen Name | UTC | Created At | Followers | Following | Favorites | Tweets | Lists | Bio | Location | URL | Verified | Default Profile | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1363640774290735104 | Xaro | Xaro_music | 2021-02-22T00:05:03.000Z | Mon Feb 22 00:05:03 +0000 2021 | 1991 | 2002 | 44872 | 30711 | 0 | Wizkid FC🦅❤🖤\n\nAfrobeat to the world🌏\n\nXvX💎... | Benin-City, Nigeria | NaN | False | True |
| 1 | 865315093541683207 | Godwinvictor5 | vkidofficial | 2017-05-18T21:16:09.000Z | Thu May 18 21:16:09 +0000 2017 | 1184 | 1040 | 239413 | 116381 | 1 | NaN | Benin-City, Nigeria | NaN | False | True |
| 2 | 996567708450918401 | timsonkim | pablotimsonTN | 2018-05-16T01:47:11.000Z | Wed May 16 01:47:11 +0000 2018 | 206 | 859 | 4738 | 35129 | 2 | Big Wiz 4 life|Blue 4 life💙 | Lagos, Nigeria | NaN | False | True |
| 3 | 239060144 | Sir Mondaylee💡 | Mondaylee | 2011-01-16T18:23:31.000Z | Sun Jan 16 18:23:31 +0000 2011 | 58536 | 43126 | 335002 | 193920 | 17 | I love dogs🐩||Personal Development Enthusiast💎... | Exactly where God wants me | https://twitter.com/search?q=from%3Amondaylee%... | False | False |
| 4 | 1346711853666344960 | 🇳🇬 OBA OF KOGI 👑 | oba_tizer | 2021-01-06T06:55:27.000Z | Wed Jan 06 06:55:27 +0000 2021 | 958 | 740 | 10970 | 13143 | 0 | DIGITAL MARKETER || ACTIVIST || PEACE ADVOCATE... | outside | NaN | False | True |
Let's just check for some info about uptown guy, to do this we're going to query the users sheet.
uptown_info = users.query('Name == "UptownGuy🦅🦅🦅"').head(1)
uptown_info
| User Id | Name | Screen Name | UTC | Created At | Followers | Following | Favorites | Tweets | Lists | Bio | Location | URL | Verified | Default Profile | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 74 | 1505566716335730688 | UptownGuy🦅🦅🦅 | UptownGuy8 | 2022-03-20T15:29:45.000Z | Sun Mar 20 15:29:45 +0000 2022 | 122 | 152 | 7875 | 5734 | 0 | God is the greatest. WiZkiD FC.. Gunner's For ... | NaN | NaN | False | True |
The query shows he has 122 followers, he's a true wizkid fan and he's also an arsenal fan. I don't know if this is true generally, but I usually see a lot of wizkid fans that are also messi fans, let's run a quick check on that.
print(uptown_info.Bio)
74 God is the greatest. WiZkiD FC.. Gunner's For ... Name: Bio, dtype: object
users['Bio'] = users['Bio'].apply(lambda text: cleaning_stopwords(text))
users['Bio'].tail()
2071 Free thinker. CHELSEA WIZKID 2072 I'm simple. 2073 Football Lover 2074 Coming soon🔥😎 2075 shawty widda big smile😌||Manchester United fan... Name: Bio, dtype: object
users['Bio']= users['Bio'].apply(lambda x: cleaning_punctuations(x))
users['Bio'].tail()
2071 Free thinker CHELSEA WIZKID 2072 Im simple 2073 Football Lover 2074 Coming soon🔥😎 2075 shawty widda big smile😌Manchester United fan❤😭... Name: Bio, dtype: object
users['Bio'] = users['Bio'].apply(lambda x: cleaning_repeating_char(x))
users['Bio'].tail()
2071 Free thinker CHELSEA WIZKID 2072 Im simple 2073 Football Lover 2074 Coming soon🔥😎 2075 shawty widda big smile😌Manchester United fan❤😭... Name: Bio, dtype: object
users['Bio'] = users['Bio'].apply(lambda x: cleaning_URLs(x))
users['Bio'].tail()
2071 Free thinker CHELSEA WIZKID 2072 Im simple 2073 Football Lover 2074 Coming soon🔥😎 2075 shawty widda big smile😌Manchester United fan❤😭... Name: Bio, dtype: object
users['Bio'] = users['Bio'].apply(lambda x: cleaning_numbers(x))
users['Bio'].tail()
2071 Free thinker CHELSEA WIZKID 2072 Im simple 2073 Football Lover 2074 Coming soon🔥😎 2075 shawty widda big smile😌Manchester United fan❤😭... Name: Bio, dtype: object
users['Bio'] = users.apply(lambda row: tokenizer.tokenize(row['Bio']), axis=1)
users['Bio'].head()
0 [Wizkid, FC, 🦅❤🖤, Afrobeat, world, 🌏, XvX, 💎💙💎... 1 [nan] 2 [Big, Wiz, lifeBlue, life, 💙] 3 [I, love, dogs, 🐩Personal, Development, Enthus... 4 [DIGITAL, MARKETER, ACTIVIST, PEACE, ADVOCATE,... Name: Bio, dtype: object
users['Bio']= users['Bio'].apply(lambda x: stemming_on_text(x))
users['Bio'].head()
0 [Wizkid, FC, 🦅❤🖤, Afrobeat, world, 🌏, XvX, 💎💙💎... 1 [nan] 2 [Big, Wiz, lifeBlue, life, 💙] 3 [I, love, dogs, 🐩Personal, Development, Enthus... 4 [DIGITAL, MARKETER, ACTIVIST, PEACE, ADVOCATE,... Name: Bio, dtype: object
users['Bio'] = users['Bio'].apply(lambda x: lemmatizer_on_text(x))
users['Bio'].head()
0 [Wizkid, FC, 🦅❤🖤, Afrobeat, world, 🌏, XvX, 💎💙💎... 1 [nan] 2 [Big, Wiz, lifeBlue, life, 💙] 3 [I, love, dogs, 🐩Personal, Development, Enthus... 4 [DIGITAL, MARKETER, ACTIVIST, PEACE, ADVOCATE,... Name: Bio, dtype: object
users_bio = " "
for bios in users.Bio:
users_bio += " ".join(bios)
wc = WordCloud(max_words=1000, width=1600, height=1000, collocations=False).generate(users_bio)
plt.imshow(wc)
print(len(users_bio))
108521
Ahhhhh. My guess was wrong after all, wizkid has more mutual fans with ronaldo than with messi, it's still good to see that some part of the majority are messi fans.
Another insight here is that wizkid fans also like burna, that's cool, two grammy award winners 😂.
So, we have finally come to the end of this. It was worthwile and I made some pretty interesting discoveries. Cheers 🎉🎉